Association Mining

Finding Patterns in Data | Lab Session, BZAN 542

Context

In this lab, I will use the (anonymised) data from a project that I worked on in 2019.

CK is an Indian food and beverages chain with about 19 outlets in 5 cities. Their outlets are popular “hangout” places for young and old alike. People often go to their stores for meeting their friends, family or just getting their Chai-tea or coffee. Imagine a cafe, basically.

Their prices are not low for Indian standards but they aren’t a luxurious store either. They offer about 100 items at their store, though only about 20 generate most revenue.

Their two most popular items are the Chai (tea) and Coffee (which they like to call Kaapi). Chai can be of several types, depending on the spice in it. It could have ginger (Adrak) and be called Adrak Chai for example. In the table below, I’m providing some popular food items and their pictures/ details.

Item Description Picture
Adrak Chai / Kadak Chai / Elaichi Chai / Other types of Chai Chai-tea with Ginger / Chai-tea with strong spices / Chai-tea with Cardamom / etc.
Kulhad Chai Chai-tea served in earthen pot. Popular in Northern India, especially New Delhi
Indian Filter Kaapi Filter Coffee, popular in Southern India
Paneer Puff A croissant-like bread filled with Paneer (Indian cottage cheese)
Veg Club Sandwich Vegetarian sandwich with grated vegetables, cheese, etc.
Maska Bun Bread and butter; commonly eaten with Chai
Biryani A slow-cooked rice dish made with Basmati rice, spices and choice of meat or vegetables

Data Analysis

Loading Packages and Setting Working Directory

Tidyverse for manipulation and visualisation. arules and arulesViz for association rules mining and visualisation. I like the theme theme_clean() from ggthemes package.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6.9000     ✔ purrr   0.3.4     
## ✔ tibble  3.1.7          ✔ dplyr   1.0.9     
## ✔ tidyr   1.2.0          ✔ stringr 1.4.1     
## ✔ readr   2.1.2          ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
theme_set(ggthemes::theme_clean())

Loading Data

You can load the CSV data and then convert it to a list format as required by arules package. It will take about 3 minutes to process.

df = read_csv("https://github.com/harshvardhaniimi/bzan-542/blob/b43cad7a71a241a0ffa11e4df369ce64fbfb54a4/Association%20Mining%20LAB/CK_data_anon.RDS") %>% 
   janitor::clean_names()

df1 = df %>% 
   select(invoice_name, item_name)


invoices = unique(df1$invoice_name)

all_items = list()

for (i in invoices)
{
   l = df1 %>% 
      filter(invoice_name == i) %>% 
      pull(item_name) %>% 
      as.character()
   
   all_items = append(all_items, list(l))
}

Or, you can directly import the list file I created for you after processing it.

df = readRDS("CK_data_anon.RDS")

Getting Ready for Analysis

All analysis with association rules has to be done on a list item. See ?transactions for more details.

Converting the df to transactions file.

trans = transactions(df)
## Warning in asMethod(object): removing duplicated items in transactions

Let’s see a summary of what we have.

summary(trans)
## transactions as itemMatrix in sparse format with
##  56737 rows (elements/itemsets/transactions) and
##  211 columns (items) and a density of 0.00928914 
## 
## most frequent items:
##          Kadak Chai Water Bottle 500 ML          Adrak Chai Indian Filter Kaapi 
##               13910               10986                9748                8935 
##        Elaichi Chai             (Other) 
##                3301               64325 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 24890 18374  7980  3315  1361   508   153    87    28    16    11     6     3 
##    17    19    20    30 
##     2     1     1     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.96    2.00   30.00 
## 
## includes extended item information - examples:
##            labels
## 1       Aam Panna
## 2      Adrak Chai
## 3 Adrak Chai Full

Let’s look at the most frequent items. Note that on the y-axis, we have the Support.

itemFrequencyPlot(trans,topN = 20)

Another way to visualise the data.

ggplot(
  tibble(
    Support = sort(itemFrequency(trans, type = "absolute"), decreasing = TRUE),
    Item = seq_len(ncol(trans))
  ), aes(x = Item, y = Support)) + geom_line()

You can note that the most popular items are very popular and the rest of the items are not as popular.

Number of Possible Associations

For this dataset, the number of possible associations is huge. But how much exactly?

2^ncol(trans)
## [1] 3.291009e+63

Woah.

Frequent Itemsets

Let’s try to find the frequent itemsets.

its = apriori(trans, parameter=list(target = "frequent"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen            target  ext
##      10 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5673 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[211 item(s), 56737 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [4 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
its
## set of 4 itemsets

Support is a parameter that needs to be optimised. To see all parameters that can be optimised, see ?ASparameter.

The lower the support parameter, the higher the number of itemsets you can generate. For large datasets, you should start from higher support values and make your way down. In this case, I tried several values and found 0.1 gave me 4 itemsets, 0.01 gave me 52 itemsets, 0.005 gave me 104 itemsets, and 0.001 gave me 440 itemsets.

It will be your call to choose the right value of support.

its = apriori(trans, parameter=list(target = "frequent", support = 0.001))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen            target  ext
##      10 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 56 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[211 item(s), 56737 transaction(s)] done [0.00s].
## sorting and recoding items ... [123 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## sorting transactions ... done [0.01s].
## writing ... [440 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
its
## set of 440 itemsets

Let’s see what we find.

its = sort(its, by = "support")
inspect(head(its, n = 10))
##      items                             support    count
## [1]  {Kadak Chai}                      0.24516629 13910
## [2]  {Water Bottle 500 ML}             0.19363026 10986
## [3]  {Adrak Chai}                      0.17181028  9748
## [4]  {Indian Filter Kaapi}             0.15748101  8935
## [5]  {Elaichi Chai}                    0.05818073  3301
## [6]  {Lemon Ice Tea}                   0.04642473  2634
## [7]  {Kadak Chai, Water Bottle 500 ML} 0.04379858  2485
## [8]  {Masala Chai}                     0.03985054  2261
## [9]  {Paneer Puff}                     0.03831715  2174
## [10] {Extra Cheese Grated}             0.03646650  2069

Let’s see how many items are brought together.

ggplot(tibble(`Itemset Size` = factor(size(its))), aes(`Itemset Size`)) + geom_bar()

Most itemsets are of size two, followed by single items.

What items are consumed in groups of three?

inspect(its[size(its) == 3])
##      items                             support count
## [1]  {Indian Filter Kaapi,                          
##       Kadak Chai,                                   
##       Water Bottle 500 ML}         0.005587183   317
## [2]  {Adrak Chai,                                   
##       Kadak Chai,                                   
##       Water Bottle 500 ML}         0.004617798   262
## [3]  {Adrak Chai,                                   
##       Extra Elaichi Flavor,                         
##       Water Bottle 500 ML}         0.003771789   214
## [4]  {Adrak Chai,                                   
##       Indian Filter Kaapi,                          
##       Water Bottle 500 ML}         0.003525037   200
## [5]  {Kadak Chai,                                   
##       Paneer Puff,                                  
##       Water Bottle 500 ML}         0.002485151   141
## [6]  {Adrak Chai,                                   
##       Kadak Chai,                                   
##       Maska Bun}                   0.002220773   126
## [7]  {Elaichi Chai,                                 
##       Kadak Chai,                                   
##       Water Bottle 500 ML}         0.002185523   124
## [8]  {Kadak Chai,                                   
##       Maska Bun,                                    
##       Water Bottle 500 ML}         0.002150272   122
## [9]  {Adrak Chai,                                   
##       Extra Cheese Grated,                          
##       Water Bottle 500 ML}         0.002150272   122
## [10] {CK Sandwich,                                  
##       Extra Cheese Grated,                          
##       Water Bottle 500 ML}         0.002115022   120
## [11] {Extra Adrak Flavor,                           
##       Kadak Chai,                                   
##       Maska Bun}                   0.001956395   111
## [12] {Adrak Chai,                                   
##       Elaichi Chai,                                 
##       Water Bottle 500 ML}         0.001885895   107
## [13] {Extra Cheese Grated,                          
##       Veg Club,                                     
##       Water Bottle 500 ML}         0.001850644   105
## [14] {Adrak Chai,                                   
##       Extra Cheese Grated,                          
##       Extra Elaichi Flavor}        0.001833019   104
## [15] {Bana Ke,                                      
##       Paneer Puff,                                  
##       Water Bottle 500 ML}         0.001797769   102
## [16] {Exotic Corn Mayo,                             
##       Extra Cheese Grated,                          
##       Water Bottle 500 ML}         0.001709643    97
## [17] {Indian Filter Kaapi,                          
##       Indian Filter Kaapi Large,                    
##       Water Bottle 500 ML}         0.001639142    93
## [18] {Adrak Chai,                                   
##       Maska Bun,                                    
##       Water Bottle 500 ML}         0.001603892    91
## [19] {Adrak Chai,                                   
##       CK Sandwich,                                  
##       Water Bottle 500 ML}         0.001533391    87
## [20] {CK Sandwich,                                  
##       Kadak Chai,                                   
##       Water Bottle 500 ML}         0.001533391    87
## [21] {Adrak Chai,                                   
##       CK Sandwich,                                  
##       Extra Cheese Grated}         0.001498141    85
## [22] {Indian Filter Kaapi Large,                    
##       Kadak Chai,                                   
##       Water Bottle 500 ML}         0.001462890    83
## [23] {Adrak Chai,                                   
##       Extra Elaichi Flavor,                         
##       Indian Filter Kaapi}         0.001462890    83
## [24] {Bana Ke,                                      
##       Kadak Chai,                                   
##       Paneer Puff}                 0.001445265    82
## [25] {Adrak Chai,                                   
##       Indian Filter Kaapi,                          
##       Kadak Chai}                  0.001392389    79
## [26] {Indian Filter Kaapi,                          
##       Paneer Puff,                                  
##       Water Bottle 500 ML}         0.001304264    74
## [27] {Indian Filter Kaapi,                          
##       Kadak Chai,                                   
##       Maska Bun}                   0.001198512    68
## [28] {Adrak Chai,                                   
##       Italian Noodles,                              
##       Water Bottle 500 ML}         0.001198512    68
## [29] {Indori Upma,                                  
##       Kadak Chai,                                   
##       Water Bottle 500 ML}         0.001163262    66
## [30] {Extra Cheese Grated,                          
##       Water Bottle 500 ML,                          
##       White Sauce Pasta}           0.001163262    66
## [31] {Adrak Chai,                                   
##       Extra Adrak Flavor,                           
##       Kadak Chai}                  0.001145637    65
## [32] {Adrak Chai,                                   
##       Cheese Chutney,                               
##       Water Bottle 500 ML}         0.001145637    65
## [33] {Adrak Chai,                                   
##       Extra Cheese Grated,                          
##       Veg Club}                    0.001128012    64
## [34] {Adrak Chai,                                   
##       Extra Adrak Flavor,                           
##       Maska Bun}                   0.001110387    63
## [35] {Adrak Chai,                                   
##       French Fries – Piri Piri,                   
##       Water Bottle 500 ML}         0.001110387    63
## [36] {Adrak Chai,                                   
##       Masala Chai,                                  
##       Water Bottle 500 ML}         0.001110387    63
## [37] {Adrak Chai,                                   
##       Elaichi Chai,                                 
##       Kadak Chai}                  0.001092761    62
## [38] {Extra Cheese Grated,                          
##       Kadak Chai,                                   
##       Water Bottle 500 ML}         0.001092761    62
## [39] {Adrak Chai,                                   
##       Paneer Puff,                                  
##       Water Bottle 500 ML}         0.001075136    61
## [40] {Extra Cheese Grated,                          
##       Indian Filter Kaapi,                          
##       Water Bottle 500 ML}         0.001075136    61
## [41] {Adrak Chai,                                   
##       Veg Club,                                     
##       Water Bottle 500 ML}         0.001057511    60
## [42] {Adrak Chai,                                   
##       Chilli Garlic Cheese Toast,                   
##       Water Bottle 500 ML}         0.001004635    57

What items are consumed in groups of four?

inspect(its[size(its) > 3])
##     items                     support count
## [1] {Adrak Chai,                           
##      Extra Adrak Flavor,                   
##      Kadak Chai,                           
##      Maska Bun}           0.001022261    58

What are the business implications of these?

  • Water 500 ml looks like its sold with a lot of items. As a business, consider adding this as a discounted pair? For example, a bottle of water costs $5. If you buy with Chai, it will cost $3.

Representing Itemsets

Maximal Itemsets

In the previously found itemsets, we included the itemsets and their supersets. However, it would not make a lot of business sense to do that.

For example, consider {Adrak Chai, Maska Bun, Water Bottle 500 ML} is one itemset. If we include this, should we also include {Adrak Chai, Water Bottle 500 ML}? Probably no.

The function ?is.maximal keeps only those itemsets if no proper superset exists for it.

its_max = its[is.maximal(its)]
its_max
## set of 309 itemsets

Let’s look at them.

inspect(head(its_max, by = "support"))
##     items                      support count
## [1] {Employee Meal,                         
##      Kadak Chai}           0.018700319  1061
## [2] {Sultan’s Kaapi}     0.008389587   476
## [3] {Lemon Ice Tea,                         
##      Water Bottle 500 ML}  0.008336711   473
## [4] {Orange Slush}         0.006133564   348
## [5] {Cinnamon Kaapi}       0.005851561   332
## [6] {Indian Filter Kaapi,                   
##      Kadak Chai,                            
##      Water Bottle 500 ML}  0.005587183   317

Association Rule Mining

These rules are to be interpreted as If This Then That (IFTT).

rules = apriori(trans, parameter = list(support = 0.001, confidence = 0.2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 56 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[211 item(s), 56737 transaction(s)] done [0.00s].
## sorting and recoding items ... [123 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [131 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
length(rules)
## [1] 131
inspect(head(rules))
##     lhs                             rhs                  support     confidence
## [1] {}                           => {Kadak Chai}         0.245166294 0.2451663 
## [2] {Kulladh Chai}               => {Extra Adrak Flavor} 0.002132647 0.5845411 
## [3] {Extra Adrak Flavor}         => {Kulladh Chai}       0.002132647 0.2494845 
## [4] {Adrak Chai Full}            => {Adrak Chai}         0.001039886 0.2243346 
## [5] {Extra Cheese Slice}         => {CK Tadka Burger}    0.001004635 0.2968750 
## [6] {Garlic Butter Bread Spread} => {Adrak Chai}         0.001833019 0.3623693 
##     coverage    lift      count
## [1] 1.000000000  1.000000 13910
## [2] 0.003648413 68.381662   121
## [3] 0.008548214 68.381662   121
## [4] 0.004635423  1.305711    59
## [5] 0.003384035 48.263028    57
## [6] 0.005058427  2.109125   104

Let’s see their quality

quality(head(rules))
##       support confidence    coverage      lift count
## 1 0.245166294  0.2451663 1.000000000  1.000000 13910
## 2 0.002132647  0.5845411 0.003648413 68.381662   121
## 3 0.002132647  0.2494845 0.008548214 68.381662   121
## 4 0.001039886  0.2243346 0.004635423  1.305711    59
## 5 0.001004635  0.2968750 0.003384035 48.263028    57
## 6 0.001833019  0.3623693 0.005058427  2.109125   104

Rules with highest lift

rules = sort(rules, by = "lift")
inspect(head(rules, n = 10))
##      lhs                      rhs                      support confidence    coverage     lift count
## [1]  {Kulladh Chai}        => {Extra Adrak Flavor} 0.002132647  0.5845411 0.003648413 68.38166   121
## [2]  {Extra Adrak Flavor}  => {Kulladh Chai}       0.002132647  0.2494845 0.008548214 68.38166   121
## [3]  {Adrak Chai,                                                                                   
##       Kadak Chai,                                                                                   
##       Maska Bun}           => {Extra Adrak Flavor} 0.001022261  0.4603175 0.002220773 53.84955    58
## [4]  {Extra Cheese Slice}  => {CK Tadka Burger}    0.001004635  0.2968750 0.003384035 48.26303    57
## [5]  {Adrak Chai,                                                                                   
##       Extra Adrak Flavor,                                                                           
##       Kadak Chai}          => {Maska Bun}          0.001022261  0.8923077 0.001145637 37.30793    58
## [6]  {Extra Adrak Flavor,                                                                           
##       Kadak Chai}          => {Maska Bun}          0.001956395  0.7872340 0.002485151 32.91474   111
## [7]  {Kadak Chai,                                                                                   
##       Paneer Puff}         => {Bana Ke}            0.001445265  0.2029703 0.007120574 28.22531    82
## [8]  {Bana Ke,                                                                                      
##       Kadak Chai}          => {Paneer Puff}        0.001445265  0.8817204 0.001639142 23.01112    82
## [9]  {Bana Ke,                                                                                      
##       Water Bottle 500 ML} => {Paneer Puff}        0.001797769  0.8571429 0.002097397 22.36969   102
## [10] {Bana Ke}             => {Paneer Puff}        0.005745810  0.7990196 0.007191075 20.85279   326

Visualisation

You can also visualise the rules you created, thanks to arulesViz package.

plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Plot with order of the itemset.

plot(rules, shading = "order")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Grouped plot

plot(rules, method = "grouped")

Graph plot

plot(rules, method = "graph")
## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

There are too many rules. Let’s retune the parameters for fewer rules.

rules = apriori(trans, parameter = list(support = 0.001, confidence = 0.4))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 56 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[211 item(s), 56737 transaction(s)] done [0.00s].
## sorting and recoding items ... [123 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [26 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
plot(rules, method = "graph")

Interactive Table and Visualisation

You can also see the rules interactively.

Table of Rules

inspectDT(rules)

Plot of Rules

plot(rules, engine = "html")

Matrix of Rules

plot(rules, method = "matrix", engine = "html") 

Graph of Rules

plot(rules, method = "graph", engine = "html")

Single-shot Analysis

You can simply pass the data here to visualise the rules directly.

ruleExplorer(df)

Reference

A large part of this tutorial follows the book chapter, Association Analysis: Basic Concepts and Algorithms.